Take-Home Exercise 3 - Geograpgically Weighted Logistic Regression (GWLR) and Application

1. Overview

In this lesson, I learn the basic concepts and methods of logistic regression specially designed for geographical data. Upon completion of this lesson, you will able to:

explain the similarities and differences between Logistic Regression (LR) algorithm versus geographical weighted Logistic Regression (GWLR) algorithm.
calibrate predictive models by using appropriate Geographically Weighted Logistic Regression algorithm for geographical data.

1.1 Overall Goal

To build an explanatory model to discover factor affecting water point status in Osun State, Nigeria

Study area: Orun State, Nigeria

1.2 Model Variables

Dependent variable: Water point status (i.e. functional / non-functional)

Independent variables:

distance_to_primary_road
distance_to_secondary_road
distance_to_tertiary_road
distance_to_city
distance_to_town
water_point_population
local_population_1km
usage_capacity
is_urban
water_source_clean

2. Setup

2.1 Packages Used

The R packages that we will be using for this analysis area:

sf: used for importing, managing, and processing geospatial data
spdep: used for computing spatial weights, global and local spatial auto-correlation statistics
tidyverse: used for wrangling attribute data
tmap: used for creating cartographic quality choropleth map
coorplot, ggpubr: used for multivariate data visualization and analysis
funModeling: used for exploratory data analysis, data preparation and model performance

In addition, the following tidyverse packages will be used:

readr for reading rectangular data from csv, tsv and fwf
tidyr for manipulating and tidying data
dplyr for wrangling and transforming data
ggplot2 for visualising data

2.2 Datasets Used

For this exercise, the data sets will be used.

2.3 Launching the packages in R

The code chunk below is used to perform the following tasks:

creating a package list containing the necessary R packages,
checking if the R packages in the package list have been installed in R,
- if they have yet to be installed, RStudio will installed the missing packages,
launching the packages into R environment.

pacman::p_load(sf, tidyverse, blorr, corrplot, ggpubr, spdep, GWmodel, tmap, skimr, caret, funModeling)

2.4 Importing the Analytical Data

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

Osun_wp_sf %>%
  freq(inpu = "status")

tmap_mode("view")
tm_shape(Osun) + 
  tm_polygons(alpha = 0.4) + 
  tm_shape(Osun_wp_sf) + 
  tm_dots(col = "status",
          alpha = 0.6) +
  tm_view(set.zoom.limits = c(9,12))

3. Summary Statistics with Skimr

We can check the quality of data set in a tabular form. This will also help to select our independent variables.

Osun_wp_sf %>%
  skim()

The use of as.factor is to convert numerical to categorical variable (i.e. factors).

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                  distance_to_primary_road,
                  distance_to_secondary_road,
                  distance_to_tertiary_road,
                  distance_to_city,
                  distance_to_town,
                  water_point_population,
                  local_population_1km,
                  usage_capacity,
                  is_urban,
                  water_source_clean),
             all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

3.1 Correlation Analysis

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7,35:39,42:43, 46:47,57)) %>%
  st_set_geometry(NULL)

cluster_vars.cor = cor(
  Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

3.2 Plotting Logistic Regression Model

model <- glm(status ~ distance_to_primary_road +
              distance_to_secondary_road +
              distance_to_tertiary_road + 
              distance_to_city +
              distance_to_town +
              is_urban + 
              usage_capacity + 
              water_source_clean +
              water_point_population +
              local_population_1km,
              data = Osun_wp_sf_clean,
              family = binomial(link = "logit"))

The code chunk below changes the regression model into a report format.

blr_regress(model)

blr_confusion_matrix(model, cutoff = 0.5)

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road,
           distance_to_secondary_road,
           distance_to_tertiary_road,
           water_point_population,
           local_population_1km,
           distance_to_city,
           distance_to_town,
           is_urban,
           usage_capacity, 
           water_source_clean)) %>%
  as_Spatial()

Osun_wp_sp

bw.fixed <- bw.ggwr(status ~
                      distance_to_primary_road +
                      distance_to_secondary_road +
                      distance_to_city +
                      distance_to_town +
                      water_point_population +
                      local_population_1km +
                      is_urban +
                      usage_capacity +
                      water_source_clean,
                    data = Osun_wp_sp,
                    family = "binomial",
                    approach = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)

bw.fixed

gwlr.fixed <- ggwr.basic(status ~
                        distance_to_primary_road +
                        distance_to_secondary_road +
                        distance_to_city +
                        distance_to_town +
                        water_point_population +
                        local_population_1km +
                        is_urban +
                        usage_capacity +
                        water_source_clean,
                        data = Osun_wp_sp,
                        bw = bw.fixed,
                        family = "binomial",
                        kernel = "gaussian",
                        adaptive = FALSE,
                        longlat = FALSE)

gwlr.fixed

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))

gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)

CM

3.3 Plotting Geographical Weighted Logistic Regression Model

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))

gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

tmap_mode("view")
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(9, 14))

prob_T

4 Re-run Logistic Regression Model without insignificant variables

model_rerun <- glm(status ~ distance_to_primary_road +
              distance_to_city +
              distance_to_town +
              is_urban + 
              usage_capacity + 
              water_source_clean +
              water_point_population +
              local_population_1km,
              data = Osun_wp_sf_clean,
              family = binomial(link = "logit"))

blr_regress(model_rerun)

blr_confusion_matrix(model_rerun, cutoff = 0.5)

Osun_wp_sp_rerun <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_primary_road,
           water_point_population,
           local_population_1km,
           distance_to_city,
           distance_to_town,
           is_urban,
           usage_capacity, 
           water_source_clean)) %>%
  as_Spatial()

Osun_wp_sp_rerun

bw.fixed_rerun <- bw.ggwr(status ~
                      distance_to_primary_road +
                      distance_to_town +
                      water_point_population +
                      local_population_1km +
                      is_urban +
                      usage_capacity +
                      water_source_clean,
                    data = Osun_wp_sp,
                    family = "binomial",
                    approach = "AIC",
                    kernel = "gaussian",
                    adaptive = FALSE,
                    longlat = FALSE)

bw.fixed_rerun

gwlr.fixed_rerun <- ggwr.basic(status ~
                        distance_to_primary_road +
                        distance_to_town +
                        water_point_population +
                        local_population_1km +
                        is_urban +
                        usage_capacity +
                        water_source_clean,
                        data = Osun_wp_sp,
                        bw = bw.fixed_rerun,
                        family = "binomial",
                        kernel = "gaussian",
                        adaptive = FALSE,
                        longlat = FALSE)

gwlr.fixed_rerun

gwr.fixed_rerun <- as.data.frame(gwlr.fixed_rerun$SDF)

gwr.fixed_rerun <- gwr.fixed_rerun %>%
  mutate(most = ifelse(
    gwr.fixed_rerun$yhat >= 0.5, T, F))

gwr.fixed_rerun$y <- as.factor(gwr.fixed_rerun$y)
gwr.fixed_rerun$most <- as.factor(gwr.fixed_rerun$most)
CM_rerun <- confusionMatrix(data = gwr.fixed_rerun$most, reference = gwr.fixed_rerun$y)

CM_rerun

5 Conclusion

On hindsight, geographical weighted logistic regression (GWLR) is more accurate than general logistic regression when there is an influence of some variables with geographical location. Hence, there is a need to use spatially non-stationary regression model.